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Extended  Abstract 


The  necessity  for  mechanisms  that  provide  service  survivability  (i.e.,  seamless  restoration  of  networking  services 
affected  by  random  failures)  in  battlefield  networks,  is  both  obvious  and  critical.  This  paper  addresses  the 
fundamental  challenge  of  providing  service  survivability  via  efficient  fault  localization  and  self-healing 
techniques  in  the  Army’s  Future  Battlefield  Networks,  examples  of  which  include  the  FCS,  WIN-T  and 
Objective  Force  Nets.  More  specifically,  we  develop  efficient  fault  diagnosis  techniques  and  self-healing 
mechanisms  in  order  to  both  rapidly  and  accurately  pin-point  the  root  cause  of  a  problem  and  to  provide  un¬ 
interrupted  services  amidst  unforeseen  failures,  in  the  ad-hoc  battlefield  environment. 

While  the  problem  of  fault  localization  and  self-healing  in  packet  networks  is  itself  complex  and  challenging,  the 
complexity  and  challenges  are  only  further  compounded  in  battlefield  networks  due  to  a  combination  of:  (a) 
random  and  ad-hoc  nature  of  battlefield  networks  coupled  with  a  mobile  infrastructure  and  (b)  presence  of 
stochastic  (soft)  failures  coupled  with  multiple  simultaneous  failures.  To  this  end,  as  part  of  a  multi-year 
research  task  under  the  Army  Research  Laboratory  (ARL)  Collaborative  Technology  Alliance  (CTA)  6.1 
program,  we  have  researched  into  and  developed  rapid  and  accurate  fault  localization  algorithms  and  dynamic 
multi-layer  self-healing  mechanisms.  We  provided  a  set  of  preliminary  results  in  the  23ld  Army  Science 
Conference  [1].  In  this  paper,  we  have  enhanced  the  localization  algorithms  to  cover  a  far  richer  set  of  conditions 
including  performance  amidst  spurious  and  erroneous  fault  symptoms,  incomplete  information  as  well  as  the 
novel  use  of  “positive”  information.  Additionally,  we  have  expanded  the  self-healing  mechanisms  to  take  into 
account  “cross-layer”  information  and  utilize  “higher”  layers  to  both  provide  service  survivability  to  the  class  of 
soft  (stochastic)  failures  and  to  pro-actively  thwart  impending  “soft”  failures.  In  the  remainder  of  this  extended- 
abstract,  we  provide  details  on  the  fault -localization  and  self-healing  work. 

With  regards  to  failure  diagnosis,  we  observe  that  existing  traditional  fault  diagnosis  methodologies  fall  severely 
short  in  the  context  of  their  applicability  to  battlefield  networks  due  to  the  fact  that  they  are  focused  to  handling 
faults  on  the  lower  (physical  and  data  link)  layers  and  are  limited  to  handling  single  failures  that  are 
predominantly  of  the  “hard”(i.e.,  deterministic/equipment-related)  type  of  failures.  On  the  contrary,  in 
battlefield  networks,  fault  diagnosis  cannot  be  constrained  to  the  lower  layers  of  the  protocol  stack.  Additionally, 
due  to  the  unpredictable  nature  of  the  network,  a  majority  of  failures  are  of  the  “soft”  (i.e., 
stochastic/performance-related)  failure  type  and  there  is  a  high  probability  of  multiple  simultaneous  soft  failures. 
We  have  therefore  done  the  following:  (a)  Developed  novel  fault  and  alarm  models  that  are  sensitive  to  the 
unique  features  of  battlefield  network  including  a  multi-layer  model  that  uses  Bayesian  techniques  to  capture  the 
dependencies  that  may  exist  between  entities  in  multiple  network  nodes  and  in  multiple  protocol  layers  at  those 
nodes  and  which  can  capture  information  pertaining  to  both  deterministic  and  non-deterministic  faults;  (b) 
Designed  new  fault  correlation  algorithms  that  operate  on  the  multi-layer  model  to  perform  failure  diagnosis. 
These  include  a  Bayesian  algorithm  that  uses  a  belief  network  representation  of  the  fault  model,  and  an 
incremental  algorithm  that  processes  symptoms  incrementally  as  they  are  received  by  the  fault  manager;  and  (c) 
Conducted  detailed  simulation  studies  of  the  incremental  multi-layer  fault  correlation  algorithms  and  showed 
that  this  algorithm  can  scale  to  at  least  100  nodes  and  still  execute  fast  enough  to  be  deployed  in  real-time.  We 
have  also  conducted  studies  on  the  impact  of  noise  in  the  network  that  may  result  in  lost  and  spurious  symptoms 
and  of  uncertainty  in  the  probabilistic  information  in  the  multi-layer  model  on  the  operation  of  our  algorithm.  As 
one  example,  we  show  in  Figure  1,  the  effect  of  spurious  symptoms  on  the  detection  rate.  Other  detailed  results 
will  be  included  in  the  full  paper. 


1  Prepared  through  collaborative  participation  in  the  Collaborative  Technology  Alliance  (CTA)  Communications  and  Networks  (C&N) 
sponsored  by  the  U.S.  Army  Research  Laboratory  under  the  Federated  Laboratory  Program,  Cooperative  Agreement  DAAD19-01  -2-001 1 . 
©  Telcordia  Technologies  Inc,  2004  and  University  of  Delaware. 
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Figure  1  Effect  of  Spurious  Symptoms  on  the  Detection  Rate 


With  regards  to  self-healing  mechanisms,  we  observe  that  once  again,  the  widely  used  self-healing  mechanisms 
for  telecommunications  networks  cannot  be  used  as -is,  since  they  are  handicapped  by  the  fact  (like  their  fault 
diagnosis  counterpart)  are  geared  mainly  to  accommodate  “hard/deterministic”  failures.  Furthermore,  they 
mainly  provide  restoration  via  the  use  of  “backup”  equipment  dedicated  solely  for  purposes  of  restoration.  These 
indeed  are  severe  shortcomings  for  use  as -is  in  the  battlefield  environment,  since  the  failures  in  battlefield 
networks  are  predominantly  “soft/non-deterministic”  and  battlefield  resources  are  expensive.  Hence,  we  have 
designed  self-healing  mechanisms  that  cater  to  the  battlefield  dynamics  with  the  following  key  features:  (a) 
multi-layer  healing  whereby  survivability  can  be  triggered  simultaneously  across  different  layers  (e.g.,  limited 
layer  1,  in  combination  with  layer  3  and/or  layer  4)  to  handle  multiple  simultaneous  failures  (b)  use  of  cross¬ 
layer  information  to  cater  to  different  survivability  requirements  imposed  by  the  wide  spectrum  of  applications 
utilizing  the  battlefield  network  (e.g.,  mission  critical  applications  vs.  non  critical),  (c)  ability  to  function  either 
re-actively  (via  a  response  to  a  failure  that  has  already  occurred)  or  proactively  (by  responding  to  “high  water¬ 
mark”  performance-related  threshold  crossings),  (d)  ability  to  provide  protection  against  malicious  network 
elements  by  triggering  an  isolation  or  network  reconfiguration,  and  (e)  rse  of  a  policy  framework  that  can 
invoke  self-healing  at  the  appropriate  layer  or  a  combination  of  layers  based  on  the  properties  of  a  failure  event; 
for  example,  mission  critical  applications  may  be  provided  layer  1  restoration  while  non-real-time  but  loss 
sensitive  battlefield  image  transfers  may  be  restored  via  layer  3  or  layer  4  healing.  We  have  also  developed 
portions  of  the  proposed  self-healing  mechanisms  in  QualNet.  As  one  example,  we  show  in  Figure  2. a  and  2.b,  a 
self-healing  action  that  is  triggered  in  response  to  a  “bad”  server  node. 


Figure  2.a:  Network  Before  Self-healing 


Figure:  2.b:  Network  After  Self-Healing 


The  self-healing  (SH)  scenario  in  Figure  2  (a,  b)  is  as  follows.  Figure  2. a  represents  the  initial  configuration  of  a 
subnet  in  the  battlefield,  wherein  the  node  “CS”  is  the  current  server  (example  “services”  include  naming, 
location,  bandwidth  brokering,  etc.)  and  nodes  SSI,  SS2  are  “server  capable”  nodes  that  have  not  been 
instantiated  as  servers  in  the  current  mission  configuration  (based  on  mission  requirements).  As  the  mission 
progresses,  CS  becomes  inaccessible  by  a  portion  of  nodes  (exhibited  by  poor  service  to  those  nodes).  The  SH 
mechanism  is  triggered  which  in  turn  instantiates  CS2  to  be  the  server  for  a  portion  of  nodes.  The  knowledge  of 
which  server  node  to  chose  as  another  server  (i.e.,  between  SSI  or  SS2)  is  based  on  node  capabilities  as  well  as 
in  consultation  with  the  service  management  layer  (SML)  by  the  SH. 

Both  fault  diagnosis  and  self-healing  require  information  about  the  network  topology,  which  is  provided  by  the 
topology  management.  We  have  designed  a  cluster-based  topology  management  scheme  which  extends  the 
scope  of  topology  information  collected  from  immediate  neighbors  to  multi-hop  neighbors  within  the  same  local 
group  so  that  nodes  may  best  be  selected  to  host  distributed  management  functionality. 

Reference: 

[1]  L.  Kant,  A. Sethi,  M.  Steinder,  “Fault  Localizatoin  and  Self-Healing  Mechanisms  for  FCS  Networks”, 
published  in  the  proceedings  of  the  23ld  Army  Science  Conference,  December  2002.  Also  a  recipient  of  Best 
Paper  Award . 


